Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher.
Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?
Some links on this page may take you to non-federal websites. Their policies may differ from this site.
-
Ensuring that technological advancements benefit all groups of people equally is crucial. The first step towards fairness is identifying existing inequalities. The naive comparison of group error rates may lead to wrong conclusions. We introduce a new method to determine whether a speaker verification system is fair toward several population subgroups. We propose to model miss and false alarm probabilities as a function of multiple factors, including the population group effects, e.g., male and female, and a series of confounding variables, e.g., speaker effects, language, nationality, etc. This model can estimate error rates related to a group effect without the influence of confounding effects. We experiment with a synthetic dataset where we control group and confounding effects. Our metric achieves significantly lower false positive and false negative rates w.r.t. baseline. We also experiment with VoxCeleb and NIST SRE21 datasets on different ASV systems and present our conclusions.more » « less
-
null (Ed.)The idea of combining multiple languages’ recordings to train a single automatic speech recognition (ASR) model brings the promise of the emergence of universal speech representation. Recently, a Transformer encoder-decoder model has been shown to leverage multilingual data well in IPA transcriptions of languages presented during training. However, the representations it learned were not successful in zero-shot transfer to unseen languages. Because that model lacks an explicit factorization of the acoustic model (AM) and language model (LM), it is unclear to what degree the performance suffered from differences in pronunciation or the mismatch in phonotactics. To gain more insight into the factors limiting zero-shot ASR transfer, we replace the encoder-decoder with a hybrid ASR system consisting of a separate AM and LM. Then, we perform an extensive evaluation of monolingual, multilingual, and crosslingual (zero-shot) acoustic and language models on a set of 13 phonetically diverse languages. We show that the gain from modeling crosslingual phonotactics is limited, and imposing a too strong model can hurt the zero-shot transfer. Furthermore, we find that a multilingual LM hurts a multilingual ASR system’s performance, and retaining only the target language’s phonotactic data in LM training is preferable.more » « less
An official website of the United States government
